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TECHNOLOGY,ING. 


hyperSPARC: PAPER 
The Next-Generation SPARC 


Introduction 





Several years aop, ROSS Tecirology set itself a coal: to ce&kelop the hicdtest- performace 
micrcorceessor in the incustry. This micrqovccessor hed to fexe tre followirg: a high 
Cegree of mernufacturabi lity, usorackdoi lity fron prior cereraticn PARC CPU modules, amd 
100% lbirary compatibility with existing PARC software. 





























The result of this arbiticus enchavor is hygerSPARC™. The hyperSPARC program achieved 
every aspect of its stated dojactives: 





°3 to 5 tines the performace of riot cereratiion PARC implerentaticns 





« Cmpletely SPARC ampatible Versio 8 Architecture, Reference Merory Managerent 
Unt M0), Iael 2 MBAs) 





« Manufactured using proven GOS technology and offered in Milti-Die Pakeging 
(MP) form 


¢ Implevented as a PARC stancard MBus module and interchangeable with existing 
mocules 








But hyperSPARC is more than a rew microorocsssor: hyperSPAR is amilestaoe. Tt eebies 
the menufacturers of SPARC systems and boarcs to conkine their market-leading 
tice/perforrmance pace, while providing software writers with a vehicle for eploiting mil - 
tieel parallel rocessirg. hypecPAR’s architecture allows it to be used in a Wich raree 
of machires, fron meinfrares to minicomputers, servers to desktcoos, lactqes to mte— 
books, making its impact in the PAR world significar. 























General Description of Product 


hyperSPARC is césigread as a tightly capled chip set awd inplarented as a PARC MBus 
mocuile using MiltiDie Packaging MP). Each hyerSPAC GU agoorts either 256, 512, ar 
1024 Koytes of seco level cats, ad ech nodule cortains ae or tw CUB. Tre chip 
set is conprised of the RI@O Gntral Processirg Unit (2), the RI@5 or RI@6 Gate 
Gatroller, Memory Management, and Tag Unit (MIU), awd four RI@27 Cacte Data Units 
(Ws) for 256 Koytes of secorbleel cache, four RI@8 Ms for 512 Koytes of seat 
Jevel cachs, or eicht RIG8 CUS for 1 Mote secoctleel acre. Tre chip & Gn ke amn- 
figured for untorocessing (deel 1 Mss) or muiltioccessirg (eel 2MAs). Figre lisa 
block diagram of the hygerSPARC chip set. 
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Figure 1. hyperSPARC Chipset 


The RI@O is the prinary processing unit in hyerSPARC. This chip is comprised of & irte— 
cer unit, a floating-point unit, ard a Kote, twoway, sctassociacive instructim cacte. 
The integer unit cortains the AU ard a separate Toad/Stare cta path, amstituting to of 
the dhiyp’s for seatim units. Tre RI@O also incheés tre flostirgpoirt unit ada 
lvarch/call unit (for processing control transfer instuuctians). Two instuuctios ae fetched 
eery clak ole. In cpreral, as log as these two instructicns require diffeert ecatim 
units and have no cata deoencéncies, they can be lanchsd simultarscusly. (kb isalsogs— 
sible to fetch awd digcatch two flcatiroyeoint acts or tno flastirosooirt multiplies at a 
tine.) The RIGA artains two register files: 136 irteer registers anfigued as 8 register 
winds, ad 2 sacarate flastiroyaoirk registers in the flcatino¢coirk unit (@s Figze 1). 





















































hypersPARC’s sscombelevel cache is built around the RIG or RI@26 CMIU, a corbined 
cacte controller and memory management unit that sugoorts sharechrarory and symmet — 
Tic multiorcesssing. The RI@ cate corbroller portim sugcorts 256 Koytes of cache, 
mece up of four RIG27 dus. The RI@6 (MIU sigoorts 512 Koytes or 1 Mecebyte of cache 
(four or eidtt RIG@8 Cs, respectively). The cacte is direct recced with 4K tags (RIG) 
or 1K tags RI). Te ect is hisially tags) ad wittally imbed & thet tre 
IUs cate ccrerency logic can quickly cé&termine snogo hits and misses without stalling 
the RI@20's access to the cade. Both agp eck ad write-throuth caching mocks are 
supported. 









































Tre MU is a PAR Referee MU with a Gentry, fully stessociactie T rarslatim 
Tockasice Buffer (ITB) that sugcorts 40% cotexts. The RI6é4 contains a red biffer @ 
bytes 69) ava write buffer (A bytes csp) for tufferirg the @ byte cate lirss in add 
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Figure 2. RT620 CPU - Major Functional Blocks 


ak of the seoctleel ccte. Tt also artains sycthrmizatic Jagic for irterfacirg tre vir- 
tual Intra Module Bus (IMB) to the PARC Mus for asydhross Goeratii¢ (6 Figure 3). 

















The RIG isal&K x B RM thet is custancssigrad for hygerP ARC's cache requirererts 
(56Kote canfiguratim). Tk is agenized as four anays of 16Kbyte static rerory with 
bytenrite lagic, registered inats, and chte-in avd Gtta-ak latches @e Figue 4). Tre 
R28, used in 5I2Koyte add byte cacte versions, is awpnizedas far anays of B 
Kctyes each. The RI@27 avd RI@8 provick a zero-wait-state cate to the @U with ro 
pirelire peralty (Le., stalls) for loess add stoves thet hit the acre. The RIG is &siged 
specifically for hyperSPARG, so it cbesYt require gle logic for irterfacing to the RIGO 
CU) ad tre RIS MU. Tre R18 requires mp ge logic to interfere to the RI@O 
acd RIG. 
























































The microarduitecture of hygerSPARC locasts classic RISC add superscalar features for 
Improving instructicn jorccessing throughout. hyperSPARC also employs architectural fea- 
tores thet differentiate it ffrom cther next-qsreraticn micrgorocessor designs. Tre follaow— 
ing sections highlictt sore of hyperSPARC’s most inportarnt attributes. 
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Figure 3. RT625 CMTU 


Multilevel Parallelism 


hyperSPARY s hich performerce is mainly de to its parallel progran eeauticn, which is 
based cn the icka thet software tasks cen he dissected irto pieces at saeral Ieels, add 
can run cmmourrertly. Hardware can be césignsd to take adventace of the parallelism 
offered by software. hyperSPARC was césigned with this in miro, taking acventage of the 
reture of softwere. 





At the too Jel of hyperSPAR’s parallel processing mocel is the incustry’s most efficiak 
use of charechrarory multiprocessing sugeort. ‘This VISIT bherchare sugoort for comecting 
multiple CPUs jarovidgs a cost-effective solutim for cresting a tidttly caypled multiicro — 
cessing systen. Conbining this omfiguratic with a symmetric, muilti-thresding qoerating 
system jorovides users with a powerful computing noce that has meny times the perfor — 
marce of a single-CFU systan. 
































Agolyirg the parallel jorocessing mock] to the micrgorocessor Jeack to today's most sochis — 
ticated agoroach to micrgorocessor architecture: suserscalar. dst as miltiple Gb cn 
work in parallel to tackle several tasks at the sare tine, the multiple processing elarets 
within each of these CUS can similtenscusly eeate saeral instuuctios. This caret 
of duplicating pipelire avd cther processirg logic within the IG, which allows multiple 
instructions to be fetched avd Jarred, is well suited to tre sinpler RISC cési¢n stuc— 
tues. 
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Figure 4. RT627/RT628 Logic Block Diagram 


The cotbirstio of parallel progran eeatticn at locth the process ad instructim leel is 
at the heart of the hperxPAR architecture. The hichly pipelined processirg paths for 
locth the inteser and flcestino¢poirt units acd actliticnal processing power to the 
hyperSP ARC architecture. 











High Frequency of Operation 





Funcarentally, hyperxSPARC is alt for goes. Th ccér to facilitate hidrelack freqencies, 
partiaular atterticn is paid to the five-stage integer avd flostinoqsoirt pipelines, kesoing 
then sinple ard well-calarced. Esch stage of the pipelire is Ly partitioned so thet 
the runter of gates per stace are similar, mekirg it easier to cb process Srirks for scaling 
to hicher clock rates. Th abitin, oly a single nisirg clock ecte is used to pacrecpte 
Instructions through the pipelire stages, because using lbcth clock ecbes crestes more cam—- 
plex timirg isses, especially for critical paths. 






























































hypersP ARCs performance scales inckpercent ly of the ettermal us (MBs). Tre 

hypersP ARC dito set is partitiaonsd to allow synchronous or asynchronous qoerati¢n 
throwh synchronization logic contained in the RT@5 ad RI@6. This ceooplirg of tre 
CPU ous from the external bus allows scaling of hyperSPARC's clock frequency indgseroe&nt 
of the merory avd 1/0 subsystems. This provices for lawer prodet life ocies becare 
uograckes to higher-cerformance hyperSPARC modules cb not require harchare chargss to 
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Figure 5. Floating-Point Instruction Launching and Its Effect on the Integer Pipeline 


Instruction Scheduling 





Trstuctim scheduling and diiscstchirg is a critical portico of ay superscalar design. 
Qctimal instructico scheduling inclucks minimizing beth pipelire stalls avd minimizing 
conditions that prevert simultansaus instuuctim Jarchirg. These pcolers ae costly for 
any RISC nechire, bout een more so wen multiple instructins are beirg held. 









































All superscalar jorocessors are mot the sare. Tre ai lity to fetch add Jarrh multiple 
instuctions is any a valéble asset if it is freqgetly ee. Grpless cn tel rete tre 
ocourrences of instructicns that can’t be launched tocether by being avare of harcare 
implementation. However, the software cn cnly qccimize for tre re; itis till tre 
do of the microorocessor to rece the cooumrece of sequential instructim Jairss. 






































hypersP ARC is partitioned irto four esatio units in actr to facilitate parallel process — 
ing of mejor instiuctim types. These esatim units ae the leeV/stae wit, banh/call 

unit, irteser unit, avd flostiroysoirk uit. The flostiropoirt uit is lly comorised of a 
quae and two parallel joipelires: an abbr amd multiplier . 




















hyperSPARC fetchss two instuucticns every clock ole ad evaluates them for simultane — 
ous launch. hyperSPARC's primary scheduler is inked for this evaluatio and ceteminss 
the harchere resources required for processirg the instructions, as well as ay chta 
cacercencies. This critical juncture in the instructicn jrocessirg path esposes the 
strengths and weaknesses of the microarchitecture. 























Poorly architected césions require nore freqert solits of instructicn groypirgs. Internal 
constraints, such as lous cesign/oandwicth and munber of read/write register ports, sare- 
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times meke it impossible to schedile miltiple-instructim Jaunchss. These cesig¢n am- 
straints manifest themselves in many ways, most often by restricting simultanscaus lanch—- 
es kesed cn the types of instructions, ac/or ackr of the instuctis, within gqrapirgs. 
The result is freqert seqertial (Gnstesd of simultarsaus) instruction Jarchss. 








hypersP ARs doility to larch miltiple irstictions simultaneously is not restricted by tre 
arcer and type of instructions within gqroypirgs. Sequential Janch is required whem there 
are resource amflicts or cata capercéncies. Bi, unlike cther superscalar cesic¢ns, any 
Instructicn can coqjpoy ary positicn in the qrausing avd still be omsictred for similtare - 
as larch. 























hypersP ARC also provides special sugeort for launching flestiro-poirk instructions. Tre 
hypersP ARC flostirg¢point unit employs tho queves: a preqee avd post quae. 








The post-qLe.e nairtains inforretim cn instructins aurertly in esatic in either the 
flcacing¢ooink acter or multiplier units. ‘This inforrecic inclucks the instuctim type, 
achress, acd stace of the pipelire for any given clock, avd it is required for excectin fen- 
dling to reooer the instructios dborted when the flostirgypoirt pinelire is flused de to 
atrao. This principle is eterékd to a flaastiro-poirt pe qas, which folcs tte sare 
Informaticn for up to four instructions doorted when the flostingyooirk instructions are 
peciirg ssatio @ flastiroqcoirt buffer of sorts). The significane of this peqease is 
that it allows flastiro-soirk instructicns to be set off to the pe qae fron tte rorel 
integer (1.e., losc/store) instnctim stream. The hperPAC scheciiler is cegeble of fetch- 
ing ad discetchirg any two flostinp¢point instuctios at a tine, saciirg cth to tre pe- 
qee in th sme clak ole. (ff the flcstirgpoirt unit is mt sy, aoe of the instucthiass 
actually byeasses the pre quae ad begins final caoxté avd eeautic immediately; the 
cther is placed in the peqee.) The intecer pipelire proceeds uninteructed to fetch, 
cacodk, avd eectte rove instructions jin the ret clok ocle. For semple, in Figre 5, 
the FP MI, wich requires the multiplier ccaypied by the EP SRT, is offlcecéd to tre FP 
peqse so thet tre inteer unit cn aabine instructim fetching ad larching. ‘This is 
mack possible by hyperSPAR’s dal-tevel instuucti decoding, which offloecs final 
irstiucticn cbecce to the filcatingypoirt unit itself. 




























































































Trteper multtolicatiaydivisim is new to PARC (Versic¢n 8). Sore next-omeratio Cus 
offlced this inkeser jorocessing to the flostino¢soirk unit. ‘This results in moe cateabim 
for this esaticn unit, further limiting the cases in which simultarsous instruction larch 
is possible, awd imposing an ached boroe& cm. the compiler ard assenbly—leel progran— 
mers who wish to write software for qotimum performance. In hyperSPAR;, irteser multi — 
plies and divicés ave seated in the integer AU, reroving this workloed fron the fllcet - 
Ingp¢eoink unit. As nore adore acolicaticns take acventace of these fircticns, 
hyperSPARC will not coypromise flostirg4point performance. 
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Multiprocessing 


Snoop Mechanism 





Milttorocessing is the key to dramatically hicter performerce from existing silicm tech- 
rology. hypexSPARC provides a glueless, starchrd interface for tidttly coupled multiaro— 














hyperSP ARC jorovides a hichsperformance srogo mechanism to facilitate efficient data 
transfers between jorocessors. Jn a write-imalichte prctocol, such as the ce implarerted 
in Teel 2 Mas, cass residing m a deredioss mst: check or “smog” exch achiress 
request to cared merory space. Tf a coc ows the cots Lire a&t_ Hh avs beer 
requested, it can rescord to the reqesst by copying the chta to nercry hich will later 
forvard the déta to the reqpesting cechs) or it Gn apoly tre cha directly to tle reqeest -— 
ding jorocessor (alled direct céta intervertic). In the ese of a direct dita irtenetion 
transfer, the cacte supplyirg the ceta must prevert nenory from cotaining the bos awd 
resconiirg to the reget. 
































Tre PAR architechure allows a wircbw of MBus clock cycles within which a cache mst 
assert the Maroxy Inhibit MH) signal if it ows the reqested exte Lire (Le., thee isa 
soogo hit). Thet wircby is A+ 2 ocles toA+7 ocles; the ‘YW’ reoesatirg tre acle in 
which the actress of the cate Line being requested is plaed cm the Mas. hyperPAR, 
implementing this seconcceneraticn multijorocessing design, regoands cn snogp hits with 
MH intteA+3 qcle. This reens thet nenory is free to reso beimig A+4. lsirg 
the full window allowed by MBus would impose a three-cycle penalty for every rerory 
access. Resoording quicklyeven thoh the MBss scecificati offers more relaed tim— 
ing erebles a very high-performance merory subsystem to be loatt around hyperSPARC. 
























































Cache Architecture 


hyperSPARC is césigred as a chip s&t to take akertape of the fact that processors a 
caches perform best: when they are tightly coupled. Cur césigners’ uerstarciirg of the 
CU's relationship with the cacte is cemenstrated in the RI@20’s design, which inposes 
qnly a oe-oycle prinery cacte miss peralty. Apipelire step is allctted in the RIGD for 
accessing the secocbleel cece so that ro stall in GU throomt is realized if tre ar 
chip cece is missed av] the secoctleel cate is hit. 

















Tre RIC’ s pirelire is a six-staece pipelirs, as down in Figuwe 6. Tre forth tae of tre 
Pipeline is the Gcte stegp, which is a ilt-in recomitic of the latency of accessing the 
somcbleel cache. Iced ad Store instructions cause the RI@0 to initiate tno accessss: 
ae to the archip Soe instructim cade an, a the sve tine, ce to the scott 
leel cecte. Tf the achress for tle instuctim is ford within the archip cats, the access 
to the sscavceleel cache is caveled ad tre instructim is aveildole at the Decooé stace 
of tre pipslire. If thee is amiss mt irterel ats, adlahit m tte socdileel 
cacte, the instruction will ke avilable after a aecycle miss peralty thet is alt irto tre 
Picelre. The significare of this césicn is thet it allows tre pirelire to procesd unirter — 
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FETCH DECODE | EXECUTE CACHE WRITE 


Figure 6. The hyperSPARC Pipeline 


muted as lang as the instructio accesses hit either the archip cece or the seo eel 
cache, which is %% ard 9% of the tine, respectively, for typical workstatio agplicsticns. 
Since the inteser avd flcstiroyeoint pirelinss mirror these fine stages for reasns of archi — 
tectural balance and ease of execticn hardling, this c&sicn ensbles the RIG2 to ahiee 
tts hich throuphout rete at gceeds that would rct cthenwise be possible. 























Pipelining and Data Forwarding 





The RI627 and RI@8 cacte data units (UB) borrowed ffrom micrgorocessor architecture 
by inplerenting chta fowerdirg add a wniqe single-stage pipeline. This pipelire casi 
allows the COUs to kesp up with the clock rates of the processor, which inclucés Jatchirg 
the dbta avd writing it irto the RM ave within a period sorter then 10 ns. As clock rates 
escalate, this task becores increasingly difficult for an SRM without novirg to gller, 























For writes irto tre (1s, hoeer, cnly the atiess and cita we latched dirirg tte write 
ole. The RI@ is thn free to coins romal instctic procsssing. The most tine- 
consuming portio of the transacticn, the write irto the RM axe, is Blaed util tre 
rext write access, where there will be a qarartesd ocle available for this atic. 




















Tre dovioss drawhack of this inplerertaticn is the possibility of a reed of the chta beirg 
held in the latchss before the RM core is usckted. = The RIGZ7 avd R628 achiess this 
using data forwarding. A comparator incorporated in the CU compares the abiess of a 
Recdirg write with the incoming reed actress. If a metch coors, chta will be forwerckd 
from the ina chta latchss directly to atat pins, bpassirg tte FM coe. In this wy, 
the most: recert dota is always provickd by tre CIB. 























Special hyperSPARC Features 


There are a rimber of sictle but clever césiqn features implarented in hyperSPARC that 
Inprove performance for commen CPU functions. Ore such feature is the RI@20 GUs 
Fast. Constant/Trogx/Branch cacdbi lity . 
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Fast Constant and Fast Index 


Fast Branch 


Fast Constant, for earple, reoreserts a comely ocourrirg cobinatim of two AI 
Instructions thet are used to csrerate S+bit amstatts. Goacifically, the SIHI avi CR 
instructic pair is used frequertly to create the 2 hidrarcér av 10 lowearctr bits, 
respectively. (hn fact, the amet PARC ompilers cererate these two instnctims from 
the psacb-instiuctim SET for sufficiently larce omstarts.) When hyperSPARC’s scheduler 
enconnters these two instructions used for cererating a Sort amstark, it larchss them 
for section in parallel, as if they wre a sirgle intiuctim. Ths, m qeeratio thet ror- 
mally takes tuo oles (Le., the settirg of the hidt ad Jowarceér hits for the cesignated 
rwgiste) is eirltodaeade. Fast Inkx wks similarly, ambining the SIH ad ID 
Instructions used to csrerate a 37 oit base achiress for array inckxing. 
















































































Fast Branch is a feature thet eliminstes waiting for the atome of a covditicraskssthing 
AQU instructicn jin order to initiate a loranch target fetch, so thet a branch add associated 
AW instructim can be Jamnchrsd simultaneously. This coos wen an inteser brarrh is 
Immediately joreceogd loy an AlUcc instructicn type in the instruction packet. In siha 
cess, the schecuiler uses a brarch prediicti strategy to launch beth instructicns simulta — 
rmecusly (nyperSPARC predicts loranch taken). ‘This recicss the number of cycles betwen 
larch resolitim avd either (1) tarost instuctic fetch ad esate, if tre bath is taen, 
or ) cottireed instructim processing, if the bach is ret taken. 

















Block Copy and Block Fill 


Block Copy/Block Fill are special festures of the RI@25 avd RI@6 (MIU. These ae soft - 
vare-initiated qoaraticns (using the STA instructions with scecial ASIs) to incresse the per — 
formance of data moverent in and at of main merory .Takirg full akertes of tre 
RIG avd RI@6's reed ad write buffers, these block manioulatio fincticns allow cata 
to moe to ad fron main menory without having to bring it into cecte. This rt aly 
saves the latency of filling the cacts, ft also allows tre RIO to coking posssirg at 























Black Copy cgoies an entire 3 byte block of cata from a cacte or main rerory Iccatic to 
arcther locaticn in main marory. This is particularly useful when copying files, cetacasss, 
or other lJaroe marory blocks to cther marory locations. Althouch the transacticn varies 
dgeerciirg cn whether the cata to be moved is in cache awl if it is cechssble oc ro 
cacteeble, the kesic principle is the sme. If tre chta is keirg copied from ce Jacatim to 
another in main memory, for eample, it is first reed irto tre red offer ad thm ters -— 
ferred to tre write luffer, to be written to the scecified nerory locatim. While tre RI@0 
must be held during the reed of the & byte lire, it is relessed to aonttine processing dr- 
ing the RT@5 or RIG26’s write back to main merory. This receserts the realization thet 
to locing this block into cache is superflucusro qoeraticns ave being performed cn it. 
Block Copy saves more than 10 clock cycles that would be encorrtered if the block were 
yeed into, adthm at of, arte. 












































hyperSPARC: The Next-Generation SPARC 





Black Fill copies into the specified marory locaticn the doublenord erbected in the sce- 
cial Blok Fill STA instructim. Tre Block Fi] works similar to the Block Gov, exact thet 
The goecified coublenord pattem is written throwhout the 24boyte black of merory avd is 
very useful in initializing larce blocks of merory. Tf the Blak Fill festure wee rct aail - 
dole, this transacticn would require the cacte Lire in menory to be written, eaxtt into 
cache, modified with a series of instructios, and then written beck at to main merory. 

















Manufacturing 


From the begiming, hygerSPARC was cesionsd for cost-effective manufachurability. Te 
opal was to meke svall cites (for cost corsicbraticns) bere as if they were cre chip (for 
the pecforrance resulting from hidmirtegraticon). The chip set wes partitioned to db just 
thet. A well-architected design like hyperSPARC can ree irterchip clays fron critical 
paths so that the chip set perfoms as well as a sirgle chip, bt withot the neuf 
and testing proolens of a hie nervlithic die. 




















hyperSPARC was cesigred with very mereqeeble die sizes, the largest (RIGO) beirg less 
then 1.5 milli transistors (the RIG2 has aoot SOK transistors). 





About ROSS Technology... 


FOSS was inconporated in Anust 1%8 ad is 7 affiliate of Firitsy, Itd. Arcticnirg 
atcnorously within the Furjitsu cayporate urorella, FOSS is fully resocnsible for all qer-— 
ational ascects of its PARC program. Gur dojective is to drive PAR, the incustry’s com - 
nant RISC architecture, to increased nerketshere thraohot the 1990's. We will axcaon- 
plish this by cmitiruing to procuice the world’s rest architecturally advanced PARC micro- 
processor procucts, inplerented in world-class (MOS tedmolosy, ad scamirg the ful 
pecforrance and orice range of computer agolicaticns. 
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